Reproducible research

The master version of this document is on github. This presentation is at shasum version 4c4d884 from that repository.

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS  10.13.6
## 
## locale:
## [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2         lubridate_1.7.3      dbplyr_1.2.1        
##  [4] dplyr_0.7.4.9000     RSQLite_2.0          hexbin_1.27.1       
##  [7] GenomicRanges_1.24.3 GenomeInfoDb_1.8.7   IRanges_2.6.1       
## [10] S4Vectors_0.10.3     BiocGenerics_0.18.0  RLinuxModules_0.2   
## [13] lattice_0.20-35      ggplot2_2.2.1        iptools_0.4.0       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14         XVector_0.12.1       pillar_1.2.1        
##  [4] plyr_1.8.4           bindr_0.1.1          zlibbioc_1.18.0     
##  [7] pfrticles_0.2        AsioHeaders_1.11.0-1 tools_3.3.3         
## [10] bit_1.1-12           digest_0.6.15        memoise_1.1.0       
## [13] evaluate_0.10.1      tibble_1.4.2         gtable_0.2.0        
## [16] pkgconfig_2.0.1      rlang_0.2.0          DBI_0.8             
## [19] yaml_2.1.18          stringr_1.3.0        knitr_1.20          
## [22] tidyselect_0.2.4     bit64_0.9-7          rprojroot_1.3-2     
## [25] grid_3.3.3           glue_1.2.0           R6_2.2.2            
## [28] rmarkdown_1.9        purrr_0.2.4          blob_1.1.0          
## [31] magrittr_1.5         backports_1.1.2      scales_0.5.0        
## [34] ISOcodes_2017.09.27  htmltools_0.3.6      assertthat_0.2.0    
## [37] colorspace_1.3-2     stringi_1.1.7        lazyeval_0.2.1      
## [40] munsell_0.4.3

Powerplant computing resources

Linux compute resource within PFR, first in first out queuing system

Open lava scheduler is used by staff to allocate computational jobs to the available resource.

## Number of available processors
$ bhosts | perl -lane '$a+=$F[3]; END{print $a}'
756

The scheduler allocates and records computer resources for each submitted job. When submitting any job to the scheduler a user should know in advance approximately what resource is required and more importantly how long the job should take

Dynamic monitoring

Do we need to monitor scheduler usage?

Yes

Why might this be useful?

Identifying usage issues, decisions around resourcing
   

We cannot address and fix issues unless we adequately monitor compute resource usage

Analysis of scheduler

Log file analysis

Written R code with a test suite to analyse
scheduler logs for dynamic report generation

   

Types of submitted jobs

Scheduled jobs can be classified into three categories

  • Pending jobs which were cancelled before running.
  • Run jobs with an error status, a failed job.
  • Run jobs with no error status, a successful job.

Summarizing the analysis of 5 years of scheduler usage logs

  • A third of the cpu time is wasted
  • Many users are submitting inefficient jobs with excessive run times.

Usually workflows using uncompressed data causes disk thrashing

Processing the log file

Perl script to process 5 years of open lava scheduler logs ( lsb.acct file) and write to an SQLite database

knitr::kable(head(cpu_table, n=3))
ev_datetime username cpu_time wallclock percent pend_time command
2016-05-06 12:00:17 cfltxm 0.167924 2 8.3962 2 bash runMe.sh
2016-05-06 12:01:26 cfltxm 0.019910 1 1.9910 5 ./runMe.sh
2016-05-20 11:44:59 hraaxt 4900.799329 1363 359.5597 5 /software/bioinformatics/FastQC-0.11.2/fastqc -t 4 -o 03.fastQC 00.RawData/green_a_C9HBWANXX_GAGTGG_L007_R1.fastq.gz 00.RawData/green_a_C9HBWANXX_GAGTGG_L007_R2.fastq.gz 00.RawData/green_b_C9HBWANXX_ACTGAT_L007_R1.fastq.gz 00.RawData/green_b_C9HBWANXX_ACT

Data fields

  • CPU time is the amount of central processing time taken (user time + system time)
  • Wallclock time is the actual time taken to run
  • Percent = 100 x cpu_time / wallclock
  • Pending time is the amount of time between submitting a job and it starting to run.
knitr::kable(head(cpu_table %>% select(-command), n=1))
ev_datetime username cpu_time wallclock percent pend_time
2016-05-06 12:00:17 cfltxm 0.167924 2 8.3962 2

If CPU time is much less than wallclock time there could be a problem

Two issues;

  • Excessive memory usage
  • Excessive I/O - data transfer between the hard disk drive and RAM

Additional User and Team information

Perl script to dynamically process active directory information

With knowledge about Users and Teams we can further categorize usage

knitr::kable(teams                           %>%
  filter(team_name == "Bioinformatics Team") %>% 
  head(n=3))
login fullname team_name
cflaxl Ashley Lu Bioinformatics Team
cflcyd Charles David Bioinformatics Team
cflsjt Susan Thomson Bioinformatics Team

Team contains 8/16 errors

"Ashley Lu", "Helge Dzierzon" have left, "Haipeng Zhang", "Jian Guo" are visitors, "Shakira Johnson" not in the team. "Ali Saei", "Marcus Davy", "Simon Deroles" are missing

Bioinformatics Team information

## Add Bioinformatics
teams$group <- bioinf_team(teams, power_users = TRUE)
teams$group <- shinyClimateChange_team(teams, group = teams$group, power_users = TRUE)
knitr::kable(head(teams))
login fullname team_name group
cfaavm Ala Mohan Postharvest Fresh Foods Team Other
cflasc Andrew Catanach Genomics Team Bioinformatics
cflaxl Ashley Lu Bioinformatics Team Other
cflbrv Bhanupratap Vanga Pathology & Appl Mycology Team Other
cflbxd Brett Davis Breeding Technologies Team Other
cflcyd Charles David Bioinformatics Team Bioinformatics

Executed job numbers

All run jobs;

count earliest latest duration
2556473 2012-10-16 11:12:58 2018-10-02 13:13:32 188100034s (~5.96 years)

There are 2556473 successfully executed jobs over 5.96 years.

All failed jobs;

count earliest latest duration
722436 2012-10-16 11:12:58 2018-10-02 12:47:40 188098482s (~5.96 years)

All successful jobs;

count earliest latest duration
1831729 2012-10-16 12:04:21 2018-10-02 13:13:32 188096951s (~5.96 years)

28.3% jobs failed

Executed job durations

Summarizing job execution durations

totals <- sum_cpu_time(cpu %>% filter(cpu_time>0))
knitr::kable(totals)
all success failed
11608914676.7243s (~367.86 years) 8782821888.15199s (~278.31 years) 2826092788.57458s (~89.55 years)

The total CPU time used by all users is 11608914676.7243s (~367.86 years).

This breaks down to 8782821888.15199s (~278.31 years) of successful jobs and 2826092788.57458s (~89.55 years) of failed jobs.

The goal is zero years of failed jobs

Per user summaries

Can summarize by username or team

per_user <- summarise_user_usage(cpu)
knitr::kable(head(anonymize(per_user)))
username total_cpu total_wall total_jobs earliest latest powerplant_age
appiarium 1.279210e+05 148918 47 2015-10-15 15:18:13 2018-05-31 05:27:20 98569935s (~3.12 years)
bioinf 4.457530e+06 17037980 430 2012-12-04 14:58:03 2017-06-08 16:18:58 188859145s (~5.98 years)
bioinformatics 1.551586e+03 16653 9 2013-04-02 17:09:06 2013-08-27 13:13:48 178569682s (~5.66 years)
cfa*** 1.638713e+05 211218 1076 2018-02-28 13:33:46 2018-06-26 11:39:24 23667402s (~39.13 weeks)
cfl*** 2.229588e+08 196372036 71808 2014-07-28 10:59:04 2018-09-27 11:31:23 136943484s (~4.34 years)
cfl*** 3.406215e+07 29095330 19171 2012-10-17 09:32:33 2014-10-06 11:17:11 193025875s (~6.12 years)

Pending time conditioned by group

Percentage efficiency conditioned by group

Wallclock load over time

Borrowing GenomicRanges infrastructure to visualize time intervals

resources are getting used more and more

High usage example

Problem when pending time is greater than run time

Coloured by Username
→ behavioural problem

Summary

Need to fully automate scheduler log analysis into accessible reports.

Urgent requirement for

  • Documentation - (how to use the scheduler)
  • Staff training - (troubleshooting, and what not to do)

within PFR using Powerplant compute resources

github:powerPlant/03_Openlava.Rmd

www.plantandfood.co.nz